The existing monocular depth estimation methods often use image semantic information to obtain depth, and ignore another important cue — defocus blur. At the same time, the defocus blur based depth estimation methods usually take the focal stack or gradient information as input, and do not consider the characteristics of the small variation of blur between image layers of the focal stack and the blur ambiguity on both sides of the focal plane. Aiming at the deficiencies of the existing focal stack depth estimation methods, a lightweight network based on three-dimensional convolution was proposed. Firstly, a Three-Dimensional perception module was designed to roughly extract the blur information of the focal stack. Secondly, the extracted information was concatenated with the difference features of the focal stack RGB channels output by a channel difference module to construct a focus volume that was able to identify the blur ambiguity patterns. Finally, a multi-scale three-dimensional convolution was used to predict the depth. Experimental results show that compared with methods such as All in Focus Depth Network (AiFDepthNet), the proposed method achieves the best on seven indicators such as Mean Absolute Error (MAE) on DefocusNet dataset, and the best on four indicators as well as the suboptimal on three indicators on NYU Depth V2 dataset; at the same time, the lightweight design reduces the inference time of the proposed method by 43.92% to 70.20% and 47.91% to 77.01% on two datasets respectively. The above verifies that the proposed method can effectively improve the accuracy and inference speed of focal stack depth estimation.